In jefftwebb/lightbulb: Tutorials for Learning Data Science

library(learnr)
knitr::opts_chunk$set(echo = FALSE, warning = F, message = F)
library(tidyverse)
library(arm)
library(janitor)
library(caret)
library(gridExtra)


sales <- read_csv("sales.csv") %>% 
  clean_names() %>% 
  mutate(price = (price * 4) %>%  round(2),
         coupon = coupon+displaycoupon,
         coupon = factor(ifelse(coupon == 0, "no", "yes"), levels = c("no","yes")),
         display = display+displaycoupon,
         display = factor(ifelse(display ==0, "no", "yes"), levels = c("no","yes")),
         sales = ifelse(price < 4, sales + 100,
                        ifelse(price < 4.5 & price > 4, sales+ 50, sales)) %>%  round(2)) %>% 
  dplyr::select(-obs, -displaycoupon)


model <- lm(sales~price, data = sales)

newsales <- filter(sales, sales < 1000)

rmse <- function(actual, fitted) sqrt(mean((actual - fitted)^2))

newsales$price2 <- newsales$price * newsales$price

newsales$price_centered <- newsales$price - mean(newsales$price)

log_multiple <- lm(log(sales) ~ price + 
                      coupon +
                      display, data = newsales)

new_multiple <- lm(sales ~ price + 
                      coupon +
                      display, data = newsales)

multiple <- lm(sales ~ price + 
                      coupon +
                      display, data = sales)

new_simple <- lm(sales ~ price, data = newsales)

simple <- lm(sales ~ price, data = sales)

caret_lm <- train(sales ~ price2 + display + coupon,
                  data = newsales,
                  method = "lm")

caret_knn <- train(sales ~ price+ display + coupon,
                  data = newsales,
                  preProcess = c("center", "scale"),
                  method = "knn")


poly <- lm(sales ~ price + I(price^2) + coupon + display,
   data = newsales)

Introduction

This tutorial focuses on using multiple linear regression to analyze the market response problem we investigated in the last tutorial. For review, here is a plot of sales ~ price with the least squares regression line:

# Visualize sales ~ price
ggplot(sales, aes(price, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price")

The plot suggests there is a negative relationship between sales and price. We tested the slope of that line using simple linear regression and discovered that it is statistically significant. However, as we noted, the linear fit was not great. Not only are there bumps in sales in certain price regions that are not well-explained by price alone, but the general relationship between price and sales is more curved than linear. We can picture this non-linearity using a LOESS (or local regression) curve:

# Visualize sales ~ price
ggplot(sales, aes(price, sales)) +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price")

Our aim here is to analyze additional factors contributing to sales, and, in doing so, to demonstrate some of the techniques discussed in the lecture videos as well as to equip you with the skills needed for this module's case. We will focus on the following (not necessarily in this order):

Multiple predictors
Outliers
Interactions
Centering and scaling inputs
Polynomial regression
Model comparison

Fitting the Model

Other variables are available to add to the model:

# Look at the data
glimpse(sales)
summary(sales)

price is the only numeric predictor; the others---display and coupon---are binary factor variables.

Note that a binary factor could also be represented as a numeric indicator variable, coded as 0 or 1. Such indicator variables are also known as dummy variables, with 0 indicating the absence of a categorical effect and 1 indicating the presence. If the categorical variable has just two levels then one dummy variable column will suffice, as here; but with more levels we need more columns to capture all of the information---equal to the number of factor levels minus one. If there were three possible categorical values, for example, then we would need two dummy variable columns. Having a column for every level of a variable---say, two columns for display---would be redundant, since all the information is included in just one column. Trying to use a column for every level creates what is known as the dummy variable trap, and will create problems for linear regression. (Representing categorical variables with dummy variables is also known as "one-hot encoding.")

We'll start out by fitting a model with all the predictor variables and compare it to the simple linear model.

# Fit simple linear model
(simple <- lm(sales ~ price, data = sales)) %>% 
  summary

# Fit multiple linear model
(multiple <- lm(sales ~ price + display  + coupon, data = sales)) %>% 
  summary

How do we compare? The simple linear regression model had an $R^2$ of close to 0, meaning that the regression line explained almost no more of the variation in sales than did the mean of sales. By that measure, the multiple linear regression model, with higher $R^2$, is clearly better. This is an informal comparison. We can compare the two models formally using an F-test implemented in the anova() function.

# Compare models
anova(simple, multiple)

The null hypothesis for this model comparison, like most statistical tests, is: no difference. In this case, the statistically significant result, and large F value, suggests that the observed difference between the models is not due to chance---random variation in sampling. Thus, we can reject the null hypothesis of no difference. Comparing models in this way using anova()---technically this is called a likelihood ratio test or LRT---can be used only with "nested models" of the same outcome, models that include overlapping sets of predictors. For example, y ~ x1, y ~ x1 + x2, and y ~ x1 + x2 + x3 are nested models. LRT allows us to compare such models for fit, for how well they describe the data. Clearly, both coupon and display are important predictors since the multiple regression model, which includes them, fits the data better than the simple linear regression, which does not. What if we removed price from the multiple regression model? Would that make the model worse?

# Compare models
anova(multiple, lm(sales ~ coupon + display, data = sales))

Yes, removing price makes the model worse, statistically speaking, suggesting that it is a statistically significant contributor to the model, above and beyond the explanatory work done by coupon and display. We will consider later whether a quadratic form of price would be an even stronger predictor. (By "quadratic form" I mean that we include a term representing price x price or $price^2$ in the model to capture the curved relationship between price and sales.)

Outliers

This dataset contains a large, single outlier that has an outsized impact on the coefficient estimates and model fit. Notice the difference in R-squared between the model with and without the outlier.

# Model with outlier
multiple %>% 
  summary

# Model without outlier
(new_multiple <- lm(sales ~ price + display  + coupon, 
   data = filter(sales, sales < 1000))) %>% 
  summary

The model fit, measured by $R^2$ is dramatically better without the outlier, and the coefficient estimates are, likewise, dramatically different in the two models. This is an overly influential point, which is apparent in the residual plot. The lm() function automatically creates 4 diagnostic plots. The which = 1 argument to plot() picks out the first, a residual plot.

# Residual plot
plot(multiple, which = 1)

This plot labels the outliers by row number. The culprit is clearly visible in the upper right: row 32. In general, outliers are not merely anomalous observations in the raw data; instead they are observations that produce large residuals, that are not explained well by the model. (An outlier in the raw data that does not turn into a large residual---that, in other words, the model does a good job of explaining---is not one we typically worry about.) Let's take a closer look at this row containing the outlier.

# Look at outlier
filter(sales, sales > 1000)

This sales number was recorded in a week that included both display and coupon promotions, so a higher sales volume was perhaps to be expected. But that high? The top 5 sales volumes provide a sense of the magnitude of the outlier:

# Look at outlier
arrange(sales, desc(sales)) %>% 
  head(5)

This observation is almost three times larger than the next largest sales volume. It may be a data collection error---a mistake---or the result of some other influence on sales that is unrecorded in this data set, such as a holiday.

What should we do?

In general, outliers should not be reflexively removed from a data set. Simply because the model fit is better without an outlier is not a reason to remove it! Instead:

Ask questions about data collection. Is the outlier a mistake, the result of an error?
Ask questions of domain experts to understand other factors influencing beer sales. Could a variable be added, or created, that would explain cases of high sales volume?

Unfortunately, adding or creating a variable would not work in the case of this single outlier because a single point is not a pattern, and models are designed to discover patterns. In the absence of further information it makes sense to remove the outlier. One sensible strategy in such situations would be to report results both with and without the outlier. In this tutorial, subsequent modeling will be done without the outlier using a filtered dataset titled newsales. Incidentally, there are formal checks for identifying outliers (for example, using Cook's distance) but I think it makes more sense to evaluate observations producing large residuals on a case by case basis rather than mechanically.

Model Interpretation

A multiple linear regression model can be used for causal inference---with caution---since each coefficient represents the change in the outcome associated with a one-unit change in the predictor, while holding the other variables fixed. In other words, the multiple regression model allows us to focus on the explanatory value of each variable independent of the others. It allows us to focus on the unique effect of a single variable after having removed the others as possible explanations. If you have been careful to include in the regression variables representing other potential explanations of the outcome, then the model, though it registers only correlations, can nevertheless be used---carefully---to make a causal argument. More on this below.

Intercept

The intercept is the fitted value of sales when the numeric predictors are 0 and the factor variable inputs are at their reference levels. In this case notice that price, though numeric, will never equal 0: the store will not give beer away. To make the intercept interpretable we would need to center price at its mean. Then the intercept would represent the fitted value of sales when price is average (equals 0) and there is no promotion.

How do we center a variable? Simply subtract the mean from every individual observation:

# Center price
newsales$price_centered <- newsales$price - mean(newsales$price)

# Check that the mean of price_centered = 0
mean(newsales$price_centered) %>% 
  round(10)

Use the new price_centered variable in the model:

# refit model
lm(sales ~ price_centered + display + coupon, data = newsales) %>% 
  summary

Notice that the value of the intercept has changed dramatically in this model while the coefficients and $R^2$ are identical. This illustrates an important point about centering. It does not change the fit of a model, but instead merely changes the location of zero, as a convenience for interpreting the intercept. The model is not better, only more interpretable if 0 had not previously been within the range of the data. Here is a plot of sales ~ price compared to sales ~ price_centered:

# Visualize sales ~ price_centered
plot1 <- ggplot(newsales, aes(price, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price")

plot2 <- ggplot(newsales, aes(price_centered, sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(title = "Sales ~ price_centered")

grid.arrange(plot1, plot2)

Exactly the same plot, but with shifted values on the x-axis. Now the intercept, 79.47, represents fitted sales (in 100s) when the numeric inputs are 0 and the categorical inputs are at the reference level---when price is average and there are no coupons or display promotions.

Coefficients

price: an increase of 1 unit in price is associated with a -75.39 change in the fitted value of sales, while holding the other variables fixed. This coefficient, as we can see from the p-value (essentially 0), is statistically different from 0, from which we can conclude that the observed value is unlikely to have occurred by chance.
display (or, more precisely, displayyes): an increase of 1 unit in display, from no to yes, is associated with 72.55 change, on average, in the fitted value of sales, while holding the other variables fixed. Another way to think about this coefficient is that it represents the difference in fitted sales between having a display and not having a display. The coefficient, in other words, represents a comparison to the reference level, which is no display. R automatically includes in the name of the coefficient the factor level above the reference level, here yes. The way to read this: if we change display to yes then we expect an increase of 72.55 in sales. (If the factor levels in this variable were reversed, with yes as the reference level, then the variable name would read displayno and the coefficient would have the opposite sign.)
coupon (or couponyes): an increase of 1 unit in coupon, from no to yes, is associated with a 212.40 change, on average, in the fitted value of sales, while holding the other variables fixed. The coefficient represents the difference in fitted sales between a coupon promotion and the reference level: no coupon promotion.

Effect Sizes and Causality

An effect size is simply the absolute value of the coefficient estimate. The larger the effect size (in absolute terms) the stronger the relationship between the predictor and outcome. (Remember: the p-value is not an effect size, but simply a measure of how unlikely the observed coefficient would be if the null hypothesis were true.) In this instance we can see that coupon had by far the largest effect on sales, after taking into account price and display as other possible explanations of sales. Usually regression results are regarded as merely uncovering correlations or associations, with no implication of causality (you may have heard the phrase: "correlation is not causation"). However, assuming that we are not missing other explanations of sales, such as seasonality or holidays, we might think about using these modeling results to make a causal argument for the impact of coupon promotions on beer sales. What would be required? To establish that $x$ causes $y$ would require additional pieces of evidence beyond a statistically significant coefficient for $x$. Minimally:

there must be covariation between $x$ and $y$ (look for a statistically significant $\beta$ coefficient for x).
there must be a time-ordered relationship between $x$ and $y$: first $x$ then $y$.
the researcher must eliminate plausible alternative explanations and make an argument.

If we were prepared to argue these points, we could say that a coupon promotion will produce, on average, 212 more units sold. Notice that this language---"will produce"---is causal.

Rather than expressing effect sizes as point estimates, it is more realistic to report confidence intervals. The method, as we've seen, is to add to and subtract from the point estimate approximately two standard errors. In the case of coupon the calculation would be:

(lower <- 212.4 - 2 * 16.94)
(upper <- 218.4 + 2 * 16.94)

Centering and Scaling

Centering an input, as we've seen, can make the intercept interpretable when 0 is not observed in the data for continuous variables. Centering is thus an interpretive convenience. Similarly, scaling a continuous input variable can be an interpretive convenience when variables have very different ranges. For example, consider two numeric variables, one with a range of [0, 1] and another with a range of [0, 100]. In the first case a one-unit increase would go from the minimum to the maximum, while in the second it would represent just a small fraction of the range. In such situations comparing coefficients to determine relative effect sizes can be difficult.

How do we scale a continuous input variable? Simply divide each observation by one standard deviation. This standardizes the variables, putting them on roughly the same scale. 99% of the scaled observations will be in the range [-3, 3]. Note: this procedure is restricted to continuous variables; you would not want to scale a dummy variable. Here is an example of centering and scaling price:

# Center and scale by hand, dividing by 1 SD
((newsales$price - mean(newsales$price)) /  sd(newsales$price)) %>% 
  head

# Use base R function, scale()
scale(newsales$price) %>% head

Some literature suggests that dividing by 2 standard deviations is better because it makes the newly scaled variables more comparable to the binary scale of dummy variables or factor variables.

# Center and scale by hand, dividing by 2 SD
((newsales$price - mean(newsales$price)) / (2 * sd(newsales$price))) %>% 
  head

# Use rescale() in arm package
library(arm)
rescale(newsales$price) %>% head

Again, why center and scale? When input variables are on substantially different scales it can be hard to compare effect sizes, represented in regression coefficients, and effect sizes (rather than p-values) are what we use to reason about relationships in our data. To solve this problem, we need to rescale continuous variables so that unit changes are comparable. Rescaling in the case of price won't make much difference since the range of the variable is already quite small. In other cases, it will make a huge difference.

Here is a made up example to illustrate the point:

# Rescaling example

set.seed(123)

# Create coefficients
a <- 1
b1 <- 2
b2 <- 10

# Create predictors and error
x1 <- runif(n = 1000, min = 1, max = 1000) # range 1:1000
x2 <- rbinom(n = 1000, size = 1, prob = .5) # range 0:1
error <- rnorm(n = 1000, mean = 0, sd = 10)

# Create outcome
y <- a + b1 * x1 + b2 * x2 + error

# Save as a data frame
rescale_data <- data.frame(x1 = x1,  
                           x2 = x2,
                           y = y)

# Look at rescale example data
rescale_data %>% 
  head

In this example x1 is a continuous predictor, with a range between 1 and 1000, while x2 is binary. Here is the regression:

# Run regression with unscaled inputs
lm(y ~ x1 + x2, rescale_data) %>%  
  summary

The regression output tells us that x2 has a much larger effect size than x1. But this is misleading. A coefficient represents the expected change in $y$ associated with a one unit change in the predictor. One unit is 100% of the range of x2 but just .1% of the range of x1. Look what happens when we center and scale:

# Run regression with centered and scaled continuous predictor
lm(y ~ rescale(x1) + x2, rescale_data) %>%  
  summary

Put on comparable scale by standardizing, x1 turns out to have a much larger effect size than x2, exactly the reverse of what we initially thought.

When should we center and scale?

To prepare data for machine learning. Most machine learning algorithms require inputs to have similar ranges.
To compare effect sizes of coefficients when input variables are on different scales. Note that scaled inputs are easy to compare but hard to interpret. After scaling, 1 unit is measured in terms of standard deviations, which is not intuitive. The above coefficient for x1 would be interpreted as the change in y associated with a 2 standard deviation change in x1, when x2 is held fixed.

When should we not center and scale?

Centering and scaling breaks the relationship with the original units. If the goal is to talk about the effect size of a predictor in terms of the original units---say, dollars or years---then we should leave the inputs unscaled.

Model Improvement

Polynomial Regression

Earlier we observed that the relationship between price and sales was curved. Non-linear relationships can be captured using linear regression in a variety of ways:

log transforming the outcome variable
adding terms with exponents (polynomial regression)
adding interactions between predictors.

Let's introduce polynomial regression by fitting a model with a quadratic term for price. Perhaps the easiest way to do this is to use the I() function, which allows us to alter the variable on the fly, like this:

# Fit the polynomial regression
(poly <- lm(sales ~ price + I(price^2) + coupon + display,
   data = newsales)) %>% 
  summary

Is this model with quadratic price better? Use the LRT to compare models:

# Compare it to the reference model
anova(new_multiple, poly)

Adding the quadratic term improves the model: $R^2$ has gone up (slightly) and residual standard error has gone down. Note:

Whenever adding a higher order term all the lower order terms need to be included in the model. Hence price also should be included along with the quadratic term.
It is possible (though I think somewhat awkward) to add a quadratic variable, consisting of price x price, directly to the dataset. I prefer to accomplish the same thing while fitting the model with I(price^2).

Notice that both the coefficient and standard error for price get quite large after including the quadratic term. There are complicated reasons for this, related to the multicollinearity between price and price^2. Multicollinearity occurs when two predictors are strongly correlated but is only a problem for model interpretation; it does not affect model fit or predictive uses of the model. The problem could be addressed in this case by centering and scaling the variable with the quadratic term, which will return the coefficients and standard errors to interpretable quantities.

In this model the coefficient for the quadratic term is negative, meaning that the resulting parabola is facing downwards. Here is an illustration of the situation:

newd <- newsales

newd$fitted <- lm(sales ~ price + I(price^2), data = newd) %>%  fitted

ggplot(newd, aes(price, sales)) + 
  geom_point()+
  labs(title = "sales ~ price + price^2")+
  theme_minimal()+
  geom_line(aes(price, fitted), col = 2)

While it is certainly possible to add higher order terms than quadratic to a model, doing so risks creating an overly complicated model, and should be avoided unless there is a strong rationale or explanation for it.

Interactions

As noted above, interactions are another way of fitting non-linear data. In this case our knowledge of the business context is admittedly limited, but we might nevertheless speculate that the relationship between price and sales will differ by whether there was also a coupon promotion. In other words, coupon promotion might systematically change the relationship between price and sales. This situation--- where a relationship between the predictor and the outcome differs by the level of a third variable--- is called an interaction, and is an extremely powerful modeling technique, both for improving model fit and for explaining relationships in the data.

To create an interaction use * rather than + to connect the terms in the model formula: sales ~ price * coupon. In fitting this model, however, we will also center price using the scale() function to make the interpretation of model coefficients simpler. (This is usually the case: the coefficients in an interaction model are more meaningful when the continuous inputs are centered.) scale() has two arguments, center and scale, both of which are set to T by default; the function will perform centering only if we explicitly turn off scaling: scale(x, scale = F). In this model we will center, but not scale, price.

# Fit an interaction model
lm(sales ~ scale(price, scale = F)  * coupon,
   data = newsales) %>% 
  summary

The interaction term is not quite significant, meaning that the relationship between price and sales does not clearly depend on coupon. Here is a picture of the situation. Plotting interactions aids in understanding them!

ggplot(newsales, aes(price, sales, col=coupon))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+
  labs(title = "sales ~ price * coupon") +
  theme_minimal()

Since the regression lines are not parallel (or even close to parallel), the sales ~ price relationship seems to depends on whether coupon is no or yes. But that conclusion appears to be statistically uncertain, despite the fairly large effect size. The standard error for the interaction term is quite large.

Interpreting the coefficients is tricky in a model with interactions.

Intercept: 102.41. This is the predicted value of the outcome, sales, when the predictor variables are either 0 (if continuous) or at the reference level (if categorical, as in this case). Here the model predicts sales of 102.4, on average, when price is 0 (or average for the centered variable) and when coupon is at its reference level of 0 (when there is no coupon promotion).
scale(price, scale = F): -103.35. In a model with an interaction, the coefficient for an interacted variable by itself is known as a "main effect." Main effects in interaction models are conditional on the variable with which they are interacted. In this case, the coefficient for centered price is the predicted average change in sales associated with a 1-unit change in price when coupon is at its reference level of 0.
coupon: 215.38. This is also a main effect. The coefficient for coupon is the predicted average change in sales associated with changing coupon from it reference level of 0 to one when centered price is 0 (or average). Moreover, we can use the output to estimate sales at average price for a coupon promotion: 102.41 + 215.38.
scale(price, scale = F):coupon: 90.58. The coefficient in this interaction represents the difference in slope between the regression line of sales ~ price when coupon is no and when coupon is yes. In other words, the relationship between sales and price gets stronger---the regression line becomes more positive, steeper---when coupon = yes compared to coupon = no. This relationship is clear in the plot. Specifically, an increase of 1 unit in price is predicted to add, on average, 90.58 more sales when coupon = yes than when coupon = no.

This model is able to capture significant non-linearity in the data:

# Fit linear model using train()
caret_lm <- lm(sales ~ price + price2 + display + coupon,
                  data = newsales,
                  method = "lm")

pd <- newsales %>% 
  mutate(fitted_lm = predict(caret_lm))

ggplot(pd, aes(price, sales))+
  geom_point()+
  geom_line(data=pd, aes(price, fitted_lm), col=2)+
  labs(title = "Linear market response model",
       subtitle= "Fitted values in red")+
  theme_minimal()